AITopics | risk score

Collaborating Authors

risk score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Auditing for Human Expertise

Neural Information Processing SystemsFeb-18-2026, 02:21:36 GMT

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained

artificial intelligence, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Government Relations & Public Policy (1.00)
Health & Medicine > Therapeutic Area > Gastroenterology (0.68)
Health & Medicine > Diagnostic Medicine (0.67)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Add feedback

Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights

Neural Information Processing SystemsFeb-17-2026, 12:24:05 GMT

Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Oceania > New Zealand (0.04)
North America > United States > California (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

b0a4b3e384b4554e65a47ad1f6b0310a-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-17-2026, 12:24:02 GMT

data mining, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
North America > United States > California (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.92)
Questionnaire & Opinion Survey (0.68)

Industry:

Government (0.92)
Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.98)
(2 more...)

Add feedback

73e0f7487b8e5297182c5a711d20bf26-Paper.pdf

Neural Information Processing SystemsFeb-12-2026, 14:41:14 GMT

disparity, risk score, xauc, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.04)
North America > United States > California (0.04)
North America > Canada (0.04)

Genre: Research Report (0.68)

Industry:

Law (1.00)
Banking & Finance (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.68)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Data Science > Data Mining (0.95)

Add feedback

1f3cbee17170c3ffff3e413d2df54f6b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 18:06:54 GMT

data mining, machine learning, prediction, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Iowa > Johnson County > Iowa City (0.14)
North America > United States > Texas > Brazos County > College Station (0.14)
North America > United States > Illinois > Cook County > Chicago (0.05)
(2 more...)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Transportation > Ground > Road (0.93)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Mining (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

FasterRisk: Fast and Accurate Interpretable Risk Scores

Neural Information Processing SystemsFeb-5-2026, 07:14:16 GMT

Over the last century, risk scores have been the most popular form of predictive model used in healthcare and criminal justice. Risk scores are sparse linear models with integer coefficients; often these models can be memorized or placed on an index card. Typically, risk scores have been created either without data or by rounding logistic regression coefficients, but these methods do not reliably produce high-quality risk scores. Recent work used mathematical programming, which is computationally slow. We introduce an approach for efficiently producing a collection of high-quality risk scores learned from data. Specifically, our approach produces a pool of almost-optimal sparse continuous solutions, each with a different support set, using a beam-search algorithm. Each of these continuous solutions is transformed into a separate risk score through a star ray search, where a range of multipliers are considered before rounding the coefficients sequentially to maintain low logistic loss. Our algorithm returns all of these high-quality risk scores for the user to consider. This method completes within minutes and can be valuable in a broad variety of applications.

artificial intelligence, machine learning, risk score, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.60)

Add feedback

Evaluating language models as risk scores

Neural Information Processing SystemsDec-26-2025, 23:34:58 GMT

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks.Conditioned on a question and answer-key, does the most likely token match the ground truth?Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty.In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks.We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products.A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks.We evaluate 17 recent LLMs across five proposed benchmark tasks.We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores.In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty.This reveals a general inability of instruction-tuned models to express data uncertainty using multiple-choice answers.A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models.These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the XAUC Metric

Neural Information Processing SystemsDec-25-2025, 14:06:08 GMT

Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance.

bipartite ranking, classification, risk score, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Yan, Yuping, Xie, Yuhan, Li, Yuanshuai, Yu, Yingchao, Lyu, Lingjuan, Jin, Yaochu

arXiv.org Artificial IntelligenceDec-12-2025

Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. T o ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs. W arning: This paper may contain some offensive content in data and model outputs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.10287

Country: Europe > Belgium (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset

Ackerman, Gary, Wilson, Theodore, Kallenborn, Zachary, Shoemaker, Olivia, Wetzel, Anna, Peterson, Hayley, Danfora, Abigail, LaTourette, Jenna, Behlendorf, Brandon, Clifford, Douglas

arXiv.org Artificial IntelligenceDec-10-2025

The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset. It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework, with previous papers detailing the development of the B3 dataset. The pilot involved running the benchmarks through a sample frontier AI model, followed by human evaluation of model responses, and an applied risk analysis of the results along several dimensions. Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM, identifying the key sources of that risk and providing guidance for priority areas of mitigation priority.

benchmark, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2512.08459

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Government > Military (0.89)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)

Add feedback